DeepSeek V4 by am17an · Pull Request #24162 · ggml-org/llama.cpp

am17an · 2026-06-05T07:11:47Z

Overview

Still a WIP, lots of work to do before this is usable. At the current stage it passes long context/tool calling tests but is quite slow. All the complexity is in the new llama-kv-cache-dsv4 + deepseekv4 model class + no new ggml ops at the moment.

To run you the flash version at least 100 GB VRAM (you can use the antirez's GGUF or use this PR to convert one), for the full flash version 160+ GB. Here's how I was running the server on a DGX spark

llama-server -m dsv4-q2_k.gguf -fa 0 -c 32768 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --fit off

Note that it is extremely slow at the moment (~4-5 toks/sec)

Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, paired with both codex and claude.

fairydreaming · 2026-06-05T11:19:19Z

@am17an I wonder what's the purpose of f32 casts and conts after mulmats here?

diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index da3536f37..c8e17ef4e 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -828,11 +828,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
     ggml_tensor * hca_state_score = nullptr;
     if (ratio == DSV4_HCA_RATIO && inp_dsv4->get_hca().state_idxs) {
         hca_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        hca_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_kv, GGML_TYPE_F32));
         cb(hca_state_kv, "hca_state_kv", il);
 
         hca_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        hca_state_score = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_score, GGML_TYPE_F32));
         cb(hca_state_score, "hca_state_score", il);
 
         ggml_tensor * ape = layer.attn_comp_ape;
@@ -848,11 +846,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
 
     if (ratio == DSV4_CSA_RATIO && inp_dsv4->get_csa().state_idxs) {
         ggml_tensor * csa_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        csa_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_kv, GGML_TYPE_F32));
         cb(csa_state_kv, "csa_state_kv", il);
 
         ggml_tensor * csa_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        csa_state_score = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_score, GGML_TYPE_F32));
         cb(csa_state_score, "csa_state_score", il);
 
         ggml_tensor * csa_ape = layer.attn_comp_ape;
@@ -902,11 +898,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
         ggml_build_forward_expand(gf, csa_state_score);
 
         ggml_tensor * lid_state_kv = build_lora_mm(layer.indexer_comp_wkv, cur);
-        lid_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_kv, GGML_TYPE_F32));
         cb(lid_state_kv, "lid_state_kv", il);
 
         ggml_tensor * lid_state_score = build_lora_mm(layer.indexer_comp_wgate, cur);
-        lid_state_score = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_score, GGML_TYPE_F32));
         cb(lid_state_score, "lid_state_score", il);
 
         ggml_tensor * lid_ape = layer.indexer_comp_ape;

Removed them and got the same logits.

am17an · 2026-06-05T12:34:26Z

@fairydreaming it's an artifact of debugging, you can push your changes to this branch (I added you as collaborator)

fairydreaming · 2026-06-08T16:01:00Z

Played with flash attention this weekend, here's my experimental patch:

diff --git a/src/llama-kv-cache-dsv4.cpp b/src/llama-kv-cache-dsv4.cpp
index 1737d62ae..82ab5f01f 100644
--- a/src/llama-kv-cache-dsv4.cpp
+++ b/src/llama-kv-cache-dsv4.cpp
@@ -323,6 +323,8 @@ static llama_kv_cache_dsv4_context::comp_plan dsv4_build_comp_plan(
         }
     }
 
+    plan.n_kv = GGML_PAD(plan.n_kv, 256u);
+
     if (overlap) {
         // [ all blocks' prev-window indices | all blocks' cur-window indices ]
         plan.state_read_idxs.reserve(overlap_prev_reads.size() + overlap_cur_reads.size());
@@ -686,7 +688,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_csa = std::make_unique<llama_kv_cache>(
             model, hparams_csa, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_CSA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_CSA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_csa, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 HCA compressed KV cache, size = %u cells\n",
@@ -694,7 +696,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_hca = std::make_unique<llama_kv_cache>(
             model, hparams_hca, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_HCA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_HCA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_hca, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 lightning-indexer KV cache, size = %u cells\n",
@@ -702,7 +704,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_lid = std::make_unique<llama_kv_cache>(
             model, hparams_lid, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_CSA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_CSA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_csa, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 CSA compressor state\n", __func__);
diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index 3f3b0cf92..7bde3bcff 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -683,6 +683,10 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_csa_lid_attention(
     ggml_tensor * kq_mask = ggml_concat(ctx0, raw_mask, csa_mask, 0);
     cb(kq_mask, "csa_lid_kq_mask", il);
 
+    if (cparams.flash_attn && kq_mask->type != GGML_TYPE_F16) {
+        kq_mask = ggml_cast(ctx0, kq_mask, GGML_TYPE_F16);
+    }
+
     ggml_tensor * out = build_attn_mha(q, k_all, k_all, nullptr, kq_mask, sinks, nullptr, kq_scale, il);
     cb(out, "attn_csa_lid", il);
 
@@ -740,6 +744,10 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_hca_attention(
     ggml_tensor * kq_mask = ggml_concat(ctx0, raw_mask, hca_mask, 0);
     cb(kq_mask, "hca_kq_mask", il);
 
+    if (cparams.flash_attn && kq_mask->type != GGML_TYPE_F16) {
+        kq_mask = ggml_cast(ctx0, kq_mask, GGML_TYPE_F16);
+    }
+
     ggml_tensor * out = build_attn_mha(q, k_all, k_all, nullptr, kq_mask, sinks, nullptr, kq_scale, il);
     cb(out, "attn_hca", il);

With FA enabled and added lightning indexer GGML OP compute buffers memory usage got really low, I think processing 1M tokens is achievable on a single RTX PRO 6000 Max-Q with CPU expert offloading (f16 cache type) even with 8k ubatch size.

Some performance numbers (Epyc 9374F + RTX PRO 6000 Max-Q):

$ ./bin/llama-batched-bench -m ../models/DeepSeek-V4-Flash.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288 -ntg 32 -fa 1 -cmoe --no-repack
0.00.471.019 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   10.947 |   748.36 |    2.981 |    10.74 |   13.927 |   590.49 |
| 16384 |     32 |    1 |  16416 |   22.323 |   733.96 |    2.817 |    11.36 |   25.140 |   652.98 |
| 32768 |     32 |    1 |  32800 |   46.820 |   699.87 |    2.878 |    11.12 |   49.699 |   659.98 |
| 65536 |     32 |    1 |  65568 |  102.828 |   637.34 |    2.962 |    10.80 |  105.790 |   619.79 |
|131072 |     32 |    1 | 131104 |  240.695 |   544.56 |    3.140 |    10.19 |  243.835 |   537.68 |
|262144 |     32 |    1 | 262176 |  624.131 |   420.01 |    3.503 |     9.14 |  627.634 |   417.72 |
|524288 |     32 |    1 | 524320 | 1860.555 |   281.79 |    4.218 |     7.59 | 1864.773 |   281.17 |

49.09.116.580 W ~llama_context:      CUDA0 compute buffer size of 24476.1461 MiB, does not match expectation of 4168.0000 MiB
49.09.116.584 W ~llama_context:  CUDA_Host compute buffer size of 16900.4862 MiB, does not match expectation of 16772.1562 MiB

Max memory usage I saw in nvidia-smi was 60836MiB / 97887MiB.

Edit: forgot about Pro benchmark results, aborted in the middle but it got to:

$ ./bin/llama-batched-bench -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Pro.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288 -ntg 32 -fa 1 -cmoe --no-repack
0.00.497.037 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   45.867 |   178.61 |    5.459 |     5.86 |   51.325 |   160.23 |
| 16384 |     32 |    1 |  16416 |   93.140 |   175.91 |    5.276 |     6.07 |   98.416 |   166.80 |
| 32768 |     32 |    1 |  32800 |  191.703 |   170.93 |    5.373 |     5.96 |  197.076 |   166.43 |
| 65536 |     32 |    1 |  65568 |  402.959 |   162.64 |    5.543 |     5.77 |  408.502 |   160.51 |
|131072 |     32 |    1 | 131104 |  883.115 |   148.42 |    5.860 |     5.46 |  888.975 |   147.48 |

fairydreaming · 2026-06-08T18:36:23Z

@am17an Any specific reason you went with DEEPSEEK_V4_FLASH/deepseek-v4-flash/deepseek_v4_flash when naming things instead of simply DEEPSEEK4/deepseek4/deepseek4? I mean this convention is a bit inconsistent with existing names and the flash part is confusing (sounds like flash only while pro uses this architecture too), maybe it would be better to change it now before it spreads? (I noticed that even the architecture name in GGUF is deepseek-v4-flash, so we'd have to update it in existing GGUF files or reconvert).

am17an · 2026-06-10T16:14:53Z

I'm going to work on making graph reuse work across various compression boundaries and also make multi-sequence work, along with fixing a couple of issues. After that I think a round of simple optimization + running some evals and then this should be ready for review.

Since it's a large PR it may make sense to separate out conversion, chat and then the model into separate PRs. In parallel #24231 + FA can be included when they're ready

fairydreaming · 2026-06-10T16:24:31Z

@am17an Sounds good, I stared at tensor values for the last few days comparing them with the DeepSeek inference code but haven't found any obvious problems.

fairydreaming · 2026-06-11T13:05:11Z

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

PP is the same as reported above, TG is ~70% faster.

rujialiu · 2026-06-12T04:55:37Z

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

Thanks! I tried but failed. It looks like antirez's gguf is not yet supported?

0.00.117.161 I srv    load_model: loading model '\gguf\DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf'
0.00.230.174 E llama_model_load: error loading model: error loading model hyperparameters: key not found in model: deepseek4.swiglu_clamp_shexp
0.00.230.184 E llama_model_load_from_file_impl: failed to load model

fairydreaming · 2026-06-12T07:37:23Z

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

Thanks! I tried but failed. It looks like antirez's gguf is not yet supported?
0.00.117.161 I srv    load_model: loading model '\gguf\DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf'
0.00.230.174 E llama_model_load: error loading model: error loading model hyperparameters: key not found in model: deepseek4.swiglu_clamp_shexp
0.00.230.184 E llama_model_load_from_file_impl: failed to load model

@rujialiu Unfortunately there are multiple naming differences for model parameters and tensors that prevent antirez GGUFs from working with this PR.

fairydreaming · 2026-06-12T08:23:44Z

@am17an On the other hand maybe it's a good idea to unify the naming with antirez GGUFs? From what I see in files there's only a single difference in tensor shapes - in attention output tensor - [4096, 1024, 8] vs [4096, 8192, 1]. I can try to fix this it in the meantime, what do you think?

rujialiu · 2026-06-12T08:36:28Z

@rujialiu Unfortunately there are multiple naming differences for model parameters and tensors that prevent antirez GGUFs from working with this PR.

Thanks for the reply. I'm especially interested in trying this REAP version in antirez's format, which (hopefully) is small enough for lower-end machines with only 64GB RAM:
https://www.modelscope.cn/models/0xSero/DeepSeek-V4-Flash-162B-GGUF

am17an · 2026-06-12T08:40:41Z

@fairydreaming sure, I think it makes sense to support already existing GGUFs. BTW can you check the latest commit for any perf improvements on your setup? Graph reuse was added across CSA boundaries

fairydreaming · 2026-06-12T09:25:05Z

@am17an Merged the changes and I see an improvement, TG in Flash now exceeds 20 t/s for short prompts (was around 18):

$ ./bin/llama-batched-bench -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Flash.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288,1048064 -ntg 128 -fa 1 -cmoe --no-repack
0.00.464.041 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |    128 |    1 |   8320 |   10.998 |   744.89 |    6.324 |    20.24 |   17.321 |   480.33 |
| 16384 |    128 |    1 |  16512 |   22.388 |   731.82 |    6.224 |    20.57 |   28.612 |   577.11 |
| 32768 |    128 |    1 |  32896 |   47.058 |   696.33 |    6.393 |    20.02 |   53.451 |   615.44 |
| 65536 |    128 |    1 |  65664 |  103.147 |   635.36 |    6.553 |    19.53 |  109.700 |   598.58 |
...

fairydreaming · 2026-06-12T16:05:23Z

@rujialiu OK, this is weird. I made a patch that allows antirez DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf that I downloaded some time ago to work in this PR, but your DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf for some reason causes CUDA error: an illegal memory access was encountered. The only difference between them is the number of experts, so it's extra weird. Still investigating.

Lowkey-Loki-SN · 2026-06-12T17:34:49Z

Not sure if it's too early for this but I'm noticing a consistently reproducible issue where the model outputs malformed JSX tags during long responses as follows:

return ( <
    section className = "hero"
    ref = { heroRef } >
    <
    span className = "hero__label label hero__animate" > Lumina < /span> {

Happens with both the raw unquantized Q8 GGUF and the quantized Q3 GGUF that I normally use but isn't reproducible with responses over the web/API.

Doesn't happen with short responses.

Repo used:

https://github.com/fairydreaming/llama.cpp, ds4 branch
Commit hash: abd1bee

Command used for HF -> GGUF:

python3 convert_hf_to_gguf.py \
  ../Models/HF/DeepSeek-V4-Flash/ \
  --outfile ~/AI/Models/GGUFs/DeepSeek-V4-Flash.gguf \
  --outtype q8_0 \
  --fp8-as-q8 \
  --use-temp-file

Command used for Quantization:

cat > "dsv4-flash-q3-robust.tensortypes" <<'EOF'
^blk\.[0-2]\.ffn_(gate|up|down)_exps\.weight$=mxfp4
^blk\.(3|42)\.ffn_down_exps\.weight$=mxfp4
ffn_down_exps=q3_K
ffn_gate_exps=q3_K
ffn_up_exps=q3_K
^token_embd\.weight$=q8_0
^output\.weight$=q8_0
indexer\.attn_q_b=q8_0
indexer=bf16
attn_comp=bf16
attn=q8_0
shexp=q8_0
nextn=q8_0
EOF

build/bin/llama-quantize \
  --allow-requantize \
  --tensor-type-file dsv4-flash-q3-robust.tensortypes \
  ../Models/GGUFs/DeepSeek-V4-Flash.gguf \
  ../Models/GGUFs/dsv4-flash-q3.gguf \
  Q3_K_S

Launch command:

CUDA_VISIBLE_DEVICES=1,0 build/bin/llama-server -m ~/AI/Models/GGUFs/dsv4-flash-q3.gguf -c 200000 -ngl 99 -fa 1 --jinja -np 1 --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --no-mmap -ot ".ffn_(up|down)_exps.=CPU","([3-7]+).ffn_.*_exps.=CPU" -ts 0.46,0.54 --port 1234 -b 2048 -ub 2048

Prompt used for this:

Create a single-file HTML website (index.html) that serves as a high-end SaaS landing page for a next-generation developer product. Format: Single HTML file, No build tools, runs directly in the browser. Libraries: React 18, Babel Standalone, GSAP (Core + ScrollTrigger). Styling: Hand-written CSS only, No Tailwind. Design: Soft off-white background, near-black text, one accent colour. Modern sans-serif body, expressive display font, Tone: Editorial, minimal, no flashy effects, no glassmorphism. Layout: Full height hero, scroll-driven typography transformation, asymmetrical product philosophy grid. Signature Animation: One unforgettable visual movement (eg. system spanning into alignment). Final CTA. All motiopn must be scroll-linked

My Setup:

2x RTX 3080 20GB
Xeon 6148
128GB DDR4 2666hz

rujialiu · 2026-06-13T02:04:11Z

@rujialiu OK, this is weird. I made a patch that allows antirez DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf that I downloaded some time ago to work in this PR, but your DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf for some reason causes CUDA error: an illegal memory access was encountered. The only difference between them is the number of experts, so it's extra weird. Still investigating.

@fairydreaming Thanks! I tried that REAP version with cchuter's branch i.e. https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda which works with that DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf on my machine (tg ~4 tok/s, pp even slower). I also got:

CUDA error: an illegal memory access was encountered
D:\llama.cpp-cchuter\ggml\src\ggml-cuda\ggml-cuda.cu:108: CUDA error
  current device: 0, in function ggml_backend_cuda_synchronize at D:\llama.cpp-cchuter\ggml\src\ggml-cuda\ggml-cuda.cu:3327

I can't check whether this REAP gguf works with antirez's ds4 because ds4 doesn't support native Windows. I had good experience running Minimax 2.5 REAP with llama.cpp, but I don't have any way to ensure that gguf is sane (or at least works with official ds4). Sorry about that.

rujialiu · 2026-06-13T02:22:55Z

@fairydreaming OK, I found that cchuter's branch works with that REAP gguf (actually I tried a slightly larger 180B REAP gguf instead) with --device none to force CPU backend. I tried some non-trivial prompts and the output looks good. So probably the REAP gguf is good and the issue is caused somewhere outside this PR (because cchuter's branch also suffers from the same issue)

am17an · 2026-06-13T05:11:14Z

@Lowkey-Loki-SN I think it is something to do with tokenization, it messes up even small JAX templates for me. Mostly extra whitespace.

Lowkey-Loki-SN · 2026-06-13T06:21:12Z

@Lowkey-Loki-SN I think it is something to do with tokenization, it messes up even small JAX templates for me. Mostly extra whitespace.

Glad to hear it's reproducible on your end too! And yes, it is always either extra whitespace or newlines when it happens on my end

fairydreaming · 2026-06-13T06:53:04Z

@rujialiu From what I see the problem is that expert indices read from tid2eid tensors during quantize_mmq_q8_1() (that is called inside ggml_cuda_mul_mat_id() are wrong (should be from 0 to 143, but I see large numbers there). But I checked these tensors in GGUF file with hexdump and found only values from 0 to 143 inside, so GGUF seems to be OK. Perhaps there's some tensor data memory corruption going on. I have a small reproducible example of this error and it works fine with ubatch 31, but fails with ubatch 32.

Edit: @am17an is right, I disabled expert offloading (so all CUDA now) and now it works on with ubatch 8 but fails with ubatch 9.

am17an · 2026-06-13T06:54:30Z

@fairydreaming ubatch 32 is when the offload would kick in, so probably something in cuda backend

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov · 2026-06-28T16:37:57Z

Btw, I've noticed that sometimes the response goes inside the reasoning block:
Is this expected?

This continues to happen with the latest version.

am17an · 2026-06-28T16:40:20Z

This continues to happen with the latest version.

What's your command to launch the server?

ggerganov · 2026-06-28T16:43:50Z

This continues to happen with the latest version.

What's your command to launch the server?

make -j && ./bin/llama-server -m ./DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --port 8014 -c 65536 --host 0.0.0.0 -lv 4

ggerganov · 2026-06-28T17:14:04Z

+    if (p0 > 0) {
+        // DSV4 compressed cache rows are derived from running compressor state,
+        // so arbitrary rollback is not reconstructible from the raw cache alone.
+        // Allow the common prompt-cache cleanup no-op: remove [end, infinity).
+        if (seq_id >= 0 && p0 > kv_raw->seq_pos_max(seq_id)) {
+            return true;
+        }
+
+        return false;
+    }


Without partial sequence removal, are we going to be able to support MTP?

We can still use checkpoint and do MTP=1. The current partial state is just ~17 Mb so it should be possible to similar to what we do in Qwen for MTP > 1

ggerganov · 2026-06-28T18:10:33Z

+    // When either raw or compressed state is per-sequence, split ubatches so
+    // every token maps cleanly to its stream. This may serialize independent
+    // non-unified sequences, but keeps compressed state ownership explicit.
+    do {
+        balloc.split_reset();
+
+        std::vector<llama_ubatch> ubatches;
+        while (true) {
+            llama_ubatch ubatch;
+            if (comp_coupled_same_set) {
+                ubatch = balloc.split_equal(n_ubatch, false);
+            } else if (comp_coupled) {
+                ubatch = balloc.split_seq(1);
+            } else if (comp_per_seq) {
+                ubatch = balloc.split_seq(n_ubatch);
+            } else {
+                ubatch = balloc.split_equal(n_ubatch, raw_per_seq);
+            }
+
+            if (ubatch.n_tokens == 0) {
+                break;
+            }
+            ubatches.push_back(std::move(ubatch)); // NOLINT
+        }
+
+        if (balloc.get_n_used() < balloc.get_n_tokens()) {
+            break;
+        }
+
+        if (auto ctx = make_context(std::move(ubatches))) {
+            return ctx;
+        }
+    } while (false);


I'm looking at the multi-sequence change (e16065f) and it seems that it does not accomplish the goal of supporting properly the non-unified KV cache. For context about how the non-unified KV cache should work see #14363. In short, it requires ubatches with equal sequence lengths (i.e. split_equal).

However, the implemented logic always does split_seq. This is the correct thing to do when the non-unified KV cache is not supported by the graph. The idea is that when we use split_seq, we guarantee that each ubatch will only have tokens from a single sequence, so the graph does not need to handle multiple streams. For example, we do the same thing with the recurrent cache when using rollbacks because the non-unified cache currently is not supported there too:

llama.cpp/src/llama-memory-recurrent.cpp

Lines 419 to 423 in 2333185

if (n_rs_seq > 0) {

// [TAG_RECURRENT_ROLLBACK_SPLITS]

// TODO: recurrent state rollback does not support equal splits

ubatch = balloc.split_seq(n_ubatch);

} else {

This is a workaround, not the proper solution. If my understanding is correct, I think a lot of the new logic added in that commit is not necessary because in the end, we still end up using split_seq instead of the desired split_equal. Therefore we can simply workaround by using split_seq, similar to the recurrent cache above and avoid the extra logic.

In the future, we have to rework non-unified KV cache to be properly supported. I'm planning to do it for the recurrent memory first, so that Qwen3.6 runs faster with parallel sequences. For DS4 I was hoping we can start with the correct implementation from the beginning. But if it is too complicated, we can try to do it later.

github-actions Bot added model Model specific python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026

github-actions Bot added script Script related testing Everything test related labels Jun 5, 2026

fairydreaming mentioned this pull request Jun 6, 2026

ggml: add DeepSeek V4 hyperconnection + KV ops (CPU) #23122

Open

fairydreaming reviewed Jun 7, 2026

View reviewed changes

Comment thread scripts/gen-chat-inline-templates.py Outdated

ngxson reviewed Jun 11, 2026

View reviewed changes

Comment thread common/chat.cpp Outdated

am17an force-pushed the dsv4 branch from 415c6d0 to e586592 Compare June 12, 2026 16:34

am17an and others added 22 commits June 29, 2026 00:21

reserve worst-case kv-cache

7b62f72

revert max split inputs

646ca19

address review comments

eb4854c

add padding to enable FA

6d78606

pad only the final value of plan.n_kv to 256

5c4870d

remove built-in cpp chat template

c0d0324

cont: remove cpp built-in template

c54dfe8

rm outdated test

aa56a11

replace ggml_view_3d() with ggml_reshape_3d()

7b2dfad

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

only support n_seq=1 for now

4ecddb8

remove unused var

73f8890

cont: remove unused var

e7ab657

use scale bias

6903be7

use correct ptr for can_reuse

3dcda81

remove gen-chat-inline-templates.py

26728ef

simplify graph reuse

5c18871

cont: cleanup

ea4dc4f

remove unused inputs

683fd55

enable partial checkpointing

bd54e8c

add correct shape for kq_mask + set llama_model_n_swa to 0 for dsv4

0994be7

precompute source_idx + add comment about dummy write

000c117

support multi-seq

e16065f

am17an force-pushed the dsv4 branch from fd0955f to e16065f Compare June 28, 2026 16:27

ggerganov reviewed Jun 28, 2026

View reviewed changes

Comment thread src/llama-kv-cache-dsv4.h Outdated

remove restored_trim_pos

2333185

ggerganov reviewed Jun 28, 2026

View reviewed changes

	if (n_rs_seq > 0) {
	// [TAG_RECURRENT_ROLLBACK_SPLITS]
	// TODO: recurrent state rollback does not support equal splits
	ubatch = balloc.split_seq(n_ubatch);
	} else {

Uh oh!

Conversation

am17an commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

fairydreaming commented Jun 5, 2026

Uh oh!

am17an commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fairydreaming commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairydreaming commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jun 10, 2026

Uh oh!

fairydreaming commented Jun 10, 2026

Uh oh!

fairydreaming commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rujialiu commented Jun 12, 2026

Uh oh!

fairydreaming commented Jun 12, 2026

Uh oh!

fairydreaming commented Jun 12, 2026

Uh oh!

rujialiu commented Jun 12, 2026

Uh oh!

am17an commented Jun 12, 2026

Uh oh!

fairydreaming commented Jun 12, 2026

Uh oh!

fairydreaming commented Jun 12, 2026

Uh oh!

Lowkey-Loki-SN commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rujialiu commented Jun 13, 2026

Uh oh!

rujialiu commented Jun 13, 2026

Uh oh!

am17an commented Jun 13, 2026

Uh oh!

Lowkey-Loki-SN commented Jun 13, 2026

Uh oh!

fairydreaming commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jun 13, 2026

Uh oh!

ggerganov commented Jun 28, 2026

Uh oh!

am17an commented Jun 28, 2026

Uh oh!

ggerganov commented Jun 28, 2026

Uh oh!

Uh oh!

ggerganov Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

am17an commented Jun 5, 2026 •

edited

Loading

am17an commented Jun 5, 2026 •

edited

Loading

fairydreaming commented Jun 8, 2026 •

edited

Loading

fairydreaming commented Jun 8, 2026 •

edited

Loading

fairydreaming commented Jun 11, 2026 •

edited

Loading

Lowkey-Loki-SN commented Jun 12, 2026 •

edited

Loading

fairydreaming commented Jun 13, 2026 •

edited

Loading

am17an Jun 28, 2026 •

edited

Loading